System and method for providing interactive audio in a multi-channel audio environment

ABSTRACT

DTS Interactive provides low cost fully interactive immersive digital surround sound environment suitable for 3D gaming and other high fidelity audio applications, which can be configured to maintain compatibility with the existing infrastructure of Digital Surround Sound decoders. The component audio is stored and mixed in a compressed and simplified format that reduces memory requirements and processor utilization and increases the number of components that can be mixed without degrading audio quality. Techniques are also provided for “looping” compressed audio, which is an important and standard feature in gaming applications that manipulate PCM audio. In addition, decoder sync is ensured by transmitting frames of “silence” whenever mixed auedio is not present either due to processing latency or the gaming application.

This Application is a divisional of U.S. Ser. No. 09/432,917 filed on 2Nov. 1999 and claims priority of that application.

FIELD OF THE INVENTION

This invention relates to fully interactive audio systems and morespecifically to a system and method of rendering real-time multi-channelinteractive digital audio to create a rich immersive surround soundenvironment suitable for 3D gaming, virtual reality and otherinteractive audio applications.

BACKGROUND OF THE INVENTION

Recent developments in audio technology have focused on creatingreal-time interactive positioning of sounds anywhere in thethree-dimensional space surrounding a listener (the “sound field”). Trueinteractive audio provides not only for the ability to create the soundon-demand but the ability to position that sound precisely in the soundfield. Support for such technologies can be found in a variety ofproducts but most frequently in software for video games to create anatural, immersive, and interactive audio environments. Applicationsextend beyond gaming into the entertainment world in the form ofaudiovisual products such as DVD, and also into video-conferencing,simulation systems and other interactive environments.

Advances in audio technology have proceeded in the direction of makingthe audio environment “real” to a listener. Monophonic recording andplayback evolved into stereophonic recording and playback and in turn,led to developments in quadraphonic recording and playback systems.Surround sound developments followed, first in the analog domain withthe development of Virtual Surround and HRTFs, Dolby Surround and laterin the digital domain with AC-3, MPEG, and DTS to immerse a listener inthe surround sound environment.

Virtual Surround Sound systems use binaural technology andpsychoacoustic cues to create the illusion of surround audio without theneed for multiple speakers. The majority of these virtualized 3D audiotechnologies are based on the concept of HRTFs (Head-Related TransferFunctions). The original digitized sound is convolved in real-time withthe left- and right-ear HRTFs corresponding to the desired spatiallocation, right- and left-ear binaural signals are produced, which whenheard seem to come from the desired location. To position the sound, theHRTFs are changed to those for the desired new location and the processrepeated. A listener can experience nearly free-field listening throughheadphones if the audio signals are filtered with that listener's ownHRTFs. However, this is often impractical and experimenters havesearched for a set of general HRTFs that have good performance for awide range of listeners. This has been difficult to accomplish with aspecific obstacle being front-back confusion, which describes thesensation that sounds either in front of or behind the head are comingfrom the same direction. Despite its drawbacks, HRTF methods have beensuccessfully applied to both PCM audio and with much lessenedcomputational load to compressed MPEG audio. Although virtual surroundtechnologies based on HRTFs provide significant benefits in situationswhere full home theater set-ups are not practical these currentsolutions do not provide any means for interactive positioning ofspecific sounds.

The Dolby Surround system is another method to implement positionalaudio. Dolby Surround is a matrix process that enables a stereo(two-channel) medium to carry four-channel audio. The system takesfour-channel audio and generates two channels of Dolby Surround encodedmaterial identified as left total (Lt) and right total (Rt). The encodedmaterial is decoded by a Dolby Pro-Logic decoder producing afour-channel output; a left channel, a right channel, a center channeland a mono surround channel. The center channel is designed to anchorvoices at the screen. The left and right channels are intended for musicand some sound effects, with the surround channel primarily dedicated tothe sound effects. The surround sound tracks are pre-encoded in DolbySurround format, and thus they are best suited for movies, and are notparticularly useful in interactive applications such as video games. PCMaudio can be overlaid on the Dolby Surround sound audio to provide aless controllable interactive audio experience. Unfortunately, mixingPCM with Dolby Surround Sound is content dependant and overlaying PCMaudio on the Dolby Surround sound audio tends to confuse the DolbyPrologic decoder, which can create undesirable surround artifacts andcrosstalk.

To improve channel separation digital surround sound technologies suchas Dolby Digital and DTS provide six discrete channels of digital soundin a left, center and right front speakers along with separate leftsurround and right surround rear speakers and a subwoofer. Digitalsurround is a pre-recorded technology and thus best suited for moviesand home A/V systems where the decoding latency can be nulled and in itspresent form it is not particularly useful for interactive applicationssuch as video games. However, since Dolby Digital and DTS provide highfidelity positional audio, have a large installed base of home theaterdecoders, definitions for a multi-channel 5.1 speaker format and productavailable for market, they present a highly desirable multi-channelenvironment for PCs and in particular console based gaming systems ifthey could be made fully interactive.

Cambridge SoundWorks offers a hybrid digital surround/PCM approach inthe form of the DeskTop Theater 5.1 DTT2500. This product features abuilt-in Dolby Digital decoder that combines pre-encoded Dolby Digital5.1 background material with interactive four-channel digital PCM audio.This system requires two separate connectors; one to deliver the DolbyDigital and one to deliver the 4-channel digital audio. Although a stepforward, Desktop Theater is not compatible with the existing installedbase of Dolby Digital decoders and requires sound cards supportingmultiple channels of PCM output. The sounds are reproduced from thespeakers located at known locations, but the goal in an interactive 3Dsound field is to create a believable environment in which sounds appearto originate from any chosen direction about the listener. The richnessof the DeskTop Theater's interactive audio is further limited by thecomputation requirements needed to process the PCM data. Sidewayslocalization, which is a critical component of a positional audioenvironment is computationally expensive to apply on time-domain data,as are the operations of filtering and equalization.

The gaming industry needs a low cost fully-interactive low latencyimmersive digital surround sound environment suitable for 3D gaming andother interactive audio applications that allows the gaming programmerto mix a large number of audio sources and to precisely position them inthe sound field and which is compatible with the existing infrastructureof home theater Digital Surround Sound systems.

SUMMARY OF THE INVENTION

In view of the above problems, the present invention provides a low costfully interactive immersive digital surround sound environment suitablefor 3D gaming and other high fidelity audio applications, which can beconfigured to maintain compatibility with the existing infrastructure ofDigital Surround Sound decoders.

This is accomplished by storing each audio component in a compressedformat that sacrifices coding and storage efficiency in favor ofcomputational simplicity, mixing the components in the subband domainrather than the time domain, recompressing and packing the multi-channelmixed audio into the compressed format and passing it to a downstreamsurround sound processor for decoding and distribution. Techniques arealso provided for “looping” compressed audio, which is an important andstandard feature in gaming applications that manipulate PCM audio. Inaddition, decoder sync is ensured by transmitting frames of “silence”whenever mixed audio is not present either due to processing latency orthe gaming application.

More specifically, the components are preferably encoded into a subbandrepresentation, compressed and packed into a data frame in which onlythe scale factors and subband data change from frame-to-frame. Thiscompressed format requires significantly less memory than standard PCMaudio but more than that required by variable length code storage suchas used in Dolby AC-3 or MPEG. More significantly this approach greatlysimplifies the unpack/pack, mix and decompress/compress operationsthereby reducing processor utilization. In addition, fixed length codes(FLCs) aid the random access navigation through an encoded bitstream.High levels of throughput can be achieved by using a single predefinedbit allocation table to encode the source audio and the mixed outputchannels. In the currently preferred embodiment, the audio renderer ishardcoded for a fixed header and bit allocation table so that the audiorenderer only need process the scale factors and subband data.

Mixing is achieved by partially decoding (decompressing) only thesubband data from components that are considered audible and mixing themin the subband domain. The subband representation lends itself to asimplified psychoacoustic masking technique so that a large number ofsources can be rendered without increasing processing complexity orreducing the quality of the mixed signal. In addition, sincemulti-channel signals are encoded into their compressed format prior totransmission, a rich high-fidelity unified surround sound signal can bedelivered to the decoder over a single connection.

These and other features and advantages of the invention will beapparent to those skilled in the art from the following detaileddescription of preferred embodiments, taken together with theaccompanying drawings, in which:

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1 a through 1 c are block diagrams of different gamingconfigurations in accordance with the present invention;

FIG. 2 is a block diagram of the application layer structure for a fullyinteractive surround sound environment;

FIG. 3 is a flowchart of the audio rendering layer shown in FIG. 2;

FIG. 4 is a block diagram of the packing process for assembling andqueuing up the output data frames for transmission to a surround sounddecoder;

FIG. 5 is a flow chart illustrating the looping of compressed audio;

FIG. 6 is a diagram depicting the organization of the data frames;

FIG. 7 is a diagram depicting the organization of the quantized subbanddata, scale factors and bit allocation in each frame;

FIG. 8 is a block diagram of the subband domain mixing process;

FIG. 9 is a diagram that illustrates the psychoacoustic masking effects;

FIGS. 10 a through 10 c diagram the bit extraction process for packingand unpacking each frame; and

FIG. 11 is diagram that illustrates the mixing of the specified subbanddata.

DETAILED DESCRIPTION OF THE INVENTION

DTS Interactive provides a low cost fully interactive immersive digitalsurround sound environment suitable for 3D gaming and other highfidelity audio applications. DTS Interactive stores the component audioin a compressed and packed format, mixes the source audio in the subbanddomain, recompresses and packs the multi-channel mixed audio into thecompressed format and passes it to a downstream surround sound processorfor decoding and distribution. DTS Interactive greatly increases thenumber of audio sources that can be rendered together in an immersivemulti-channel environment without increasing the computational load ordegrading the rendered audio. DTS Interactive simplifies equalizationand phase positioning operations. In addition techniques are providedfor “looping” compressed audio and decoder sync is ensured bytransmitting frames of “silence” whenever source audio is not presentwhere silence includes true silence or low level noise. DTS Interactiveis designed to maintain backward compatibility with the existinginfrastructure of DTS Surround Sound decoders. However, the describedformatting and mixing techniques could be used to design a dedicatedgaming console that would not be limited to maintaining source and/ordestination compatibility with the existing decoder.

DTS Interactive

The DTS Interactive system is supported by multiple platforms, of whichthere are the DTS 5.1 multi-channel home theatre system 10, whichincludes a decoder and an AV amplifier, a sound card 12 equipped withhardware DTS decoder chipset with an AV amplifier 14, or a softwareimplemented DTS decoder 16 with an audio card 18 and an AV amplifier 20,see FIGS. 1 a, 1 b and 1 c. All those systems require a set of speakersnamed left 22, right 24, left surround 26, right surround 28, center 30and sub-woofer 32, a multi-channel decoder and a multi-channelamplifier. The decoder provides digital S/PDIF or other input forsupplying compressed audio data. The amplifier powers six discretespeakers. Video is rendered on a display or projection device 34,usually a TV or other monitor. A user interacts with the AV environmentthrough a human interface device (HID) such as a keyboard 36, mouse 38,position sensor, trackball or joy stick.

Application Programming Interface (API)

As shown in FIGS. 2 and 3, the DTS Interactive system consists of threelayers: the application 40, the application programming interface (API)42 and the audio renderer 44. The software application could be a game,or maybe a music playback/composition program, which takes componentaudio files 46 and assigns to each some default positional character 48.The application also accepts interactive data from the user via an HID36/38.

For each game level, frequently used audio components are loaded intomemory (step 50). Because each component is treated as an object theprogrammer is kept unaware of the sound format and rendering details, heneed only be concerned with the absolute position to the listener andthe effects processing that might be desired. The DTS Interactive formatallows these components to be mono, stereo or multi-channel with orwithout low frequency effects (LFE). Since DTS Interactive stores thecomponents in a compressed format (see FIG. 6) valuable system memory issaved that can otherwise be used for higher resolution video rendering,more colors, or more textures. The reduced file size resulting from thecompressed format also permits rapid on demand loading from the storagemedia. The sound components are provisioned with parameters to detailthe position, equalization, volume and necessary effects. These detailswill influence the outcome of the rendering process.

API layer 42 provides an interface for the programmer to create andcontrol each sound effect and also provides isolation from thecomplicated real-time audio rendering process that deals with the mixingof the audio data. Object orientated classes create and control thesound generation. There are several class members at the programmersdisposal, which are as follows: load, unload, play, pause, stop,looping, delay, volume, equalization, 3D position, maximum and minimumsound dimensions of the environment, memory allocation, memory lockingand synchronization.

The API generates a record of all sound objects created and loaded intomemory or accessed from media (step 52). This data is stored in anobject list table. The object list does not contain the actual audiodata but rather tracks information important to the generation of thesound such as information to indicate the position of the data pointerwithin the compressed audio stream, the position coordinates of thesound, the bearing and distance to the listener's location, the statusof the sound generation and any special processing requirements formixing the data. When the API is called to create a sound object, areference pointer to the object is automatically entered into the objectlist. When an object is deleted, the corresponding pointer entry in theobject list is set to null. If the object list is full then a simple agebased caching system can choose to overwrite old instances. The objectlist forms the bridge between the asynchronous application, thesynchronous mixer and compressed audio generator processes.

The classes inherited by each object permit start, stop, pause, load andunload functions to control the generation of the sound. These controlsallow the play list manager to examine the object list and construct aplay list 53 of only those sounds that are actively playing at thatmoment in time. The manager can decide to omit a sound from the playlist if it is paused, stopped, has completed playing or has not beendelayed sufficiently to commence playing. Each entry in the play list isa pointer to individual frames within a sound that must be examined andif necessary piecewise unpacked prior to mixing. Since frame sizes areconstant, manipulation of the pointer permits playback positioning,looping and delay of the output sound. This pointer value indicates thecurrent decoding position within the compressed audio stream.

The positional localization of sounds requires the assignment of soundsto individual rendering pipelines or execute buffers that in turn mapdirectly onto the arrangement of the loudspeakers (step 54). This is thepurpose of the mapping function. Position data for entries in the framelist are examined to determine which signal processing functions toapply, renew the bearings and direction of each sound to the listener,alter each sound depending on physical models for the environment,determine mixing coefficients and allocate audio streams to availableand most appropriate speakers. All parameters and model data arecombined to deduce modifications to the scale factors associated witheach compressed audio frame entering a pipeline. If side localization isdesired, data from the phase shift tables are indicated and indexed.

Audio Rendering

As shown in FIGS. 2 and 3, audio rendering layer 44 is responsible formixing the desired subband data 55 according to the 3D parameters 57 setby the object classes. The mixing of multiple audio components requiresthe selective unpacking and decompression of each component, summing ofcorrelated samples and the calculation of a new scale factor for eachsubband. All processes in the rendering layer must function in real-timeto deliver a smooth and continuous flow of compressed audio data to thedecoding system. A pipeline receives a listing of the sound objects inplay and, from within each object, directions for the modification ofthe sound. Each pipeline is designed to manipulate the component audioaccording to the mixing coefficients and to mix an output stream for asingle speaker channel. The channels are packed and multiplexed into aunified output bitstream.

More specifically, the rendering process commences by unpacking anddecompressing each component's scale factors into memory on aframe-by-frame basis (step 56), or alternately multiple frames at a time(see FIG. 7). At this stage only the scale factor information for eachsubband is required to assess if that component, or portions of thecomponent, will be audible in the rendered stream. Since fixed lengthcoding is used, it is possible to unpack and decompress only that partof the frame that contains the scale factors thereby reducing processorutilization. For SIMD performance reasons each 7-bit scale factor valueis stored as a byte in memory space, and aligned to a 32-byte addressboundary to ensure that a cache line read will obtain all scale factorsin one cache fill operation and not cause cache memory pollution. Tofurther speed this operation, the scale factors may be stored as bytesin the source material and organized to occur in memory on 32 byteaddress boundaries.

The 3D parameters 57 provided by the 3D position, volume, mixing andequalization are combined to determine a modification array for eachsubband that is used to modify the extracted scale factors (step 58).Because each component is represented in the subband domain equalizationis a trivial operation of adjusting the sub-band coefficients as desiredvia the scale factors.

In step 60, the maximum scale factors indexed for all elements in thepipeline are located and stored to an output array, which is suitablyaligned in memory space. This information is used to decide the need tomix certain subband components.

At this point, step 62, masking comparisons are made with the otherpipelined sound objects to remove the inaudible subbands from thespeaker pipelines (see FIGS. 8 and 9 for details). The maskingcomparisons are preferably done for each subband independently toimprove speed and are based upon the scale factors for the objectsreferenced by the list. A pipeline contains only that information whichis audible from a single speaker. The advantage of DTS Interactive overmanipulation of PCM time-domain audio is that the gaming programmer isallowed to use many more components and rely on the masking routine toextract and mix only the audible sounds at any given time without excesscomputations.

Once the desire subbands are identified, the audio frames are furtherunpacked and decompressed to extract only the audible subband data (step64), which is stored as left shifted DWORD format in memory (see FIGS.10 a-10 c). Throughout the description the DWORD is assumed without lossof generality to be 32 bits. In the gaming environment, the price paidin lost compression for using FLCs is more than compensated by thereduction in the number of computations required to unpack anddecompress the subband data. This process is further simplified by usinga single predefined bit allocation table for all of the components andchannels. FLCs enable random positioning of the read position at anysubband within the component.

In step 66, phase positioning filtering is applied to the subband datafor bands 1 and 2. The filter has specific phase characteristics andneed only be applied over the frequency range 200Hz to 1200Hz where theear is most sensitive to positional cues. Since phase positioncalculations are only applied to first two bands of the 32 subbands thenumber of computations is approximately one-sixteenth the numberrequired for an equivalent time-domain operation. The phase modificationcan be ignored if sideways localization is not a necessity or if thecomputational overhead is viewed excessive.

In step 68, subband data is mixed by multiplying it by the correspondingmodified scale factor data and summing it with the scaled subbandproducts of the other eligible subband components in the pipeline (seeFIG. 11). The normal multiplication by step-size, which is dictated bythe bit allocation, is avoided by predefining the bit allocation tableto be the same for all audio components. The maximum scale factorsindexes are looked up and divided into (or multiplied by inverse) themixed result. The division and multiplication by the inverse operationsare mathematically equivalent but the multiplication operation is anorder of magnitude faster. Overflow can occur when the mixed resultexceeds the value stored in one DWORD. Attempting to store afloating-point word as an integer creates an exception which is trappedand used to correct the scale factor applied to the affected subband.After the mixing process, data is stored in left shifted form by numericmodification of the scale factor data.

Assembly and Queuing of Output Data Frames

As shown in FIG. 4, a controller 70 assembles output frames 72 andplaces them in a queue for transmission to a surround sound decoder. Adecoder will only produce useful output if it can align to the repeatingsynchronization markers or sync codes embedded within the data stream.The transmission of coded digital audio via a S/PDIF data stream is anamendment of the original IEC958 specification and does not makeprovision for the identity of the coded audio format. The multiformatdecoder must first determine the data format by reliably detectingconcurrent sync words and then establish an appropriate decoding method.A loss of sync condition leads to an intermission in the audioreproduction as the decoder mutes its output signal and seeks tore-establish the coded audio format.

Controller 70 prepares a null output template 74 that includescompressed audio representing “silence”. In the currently preferredembodiment, there are no differences in the header information fromframe to frame and only the scale factors and subband data regions needto be updated. The template header carries unchanging informationregarding the format of the stream bit allocation and the sideinformation for decoding and unpacking information. The completed nulloutput template 74 is queued in the sound card buffer (step 76).

Concurrently, the audio renderer is generating the list of soundobjects, mapping them to the speaker locations and mixing the audiblesubband data as described above. The multi-channel subband datagenerated by the pipelines 82, is compressed (step 78) into FLCs 80 inaccordance with the predefined bit allocation table, which are organizedin parallel, each specific to a particular speaker channel. If amodified ITU speaker arrangement is adopted then the left surround andright surround channels are delayed 84 by a whole number of compressedaudio frames. A packer 86 packs the scale factor and subband data (step88) and submits the packed data to controller 70. The possibility offrame overflow is eliminated as the bit allocation tables for eachchannel in the output stream are predefined. The DTS Interactive formatis not bit-rate limited and the simpler and more rapid encodingtechniques of linear and block encoding can be applied.

To maintain decoder sync, controller 70 determines whether the nextframe of packed data is ready for output (step 92). If the answer isyes, controller 70 writes the packed data (scale factors and subbanddata) over the previous output frame 72 (step 94) and puts it in thequeue (step 96). If the answer is no, controller 70 outputs null outputtemplate 74. Sending compressed silence in this manner guarantees theinterruption free output of frames to the decoder to maintain sync.

In other words, controller 70 provides a data pump process whosefunction is to manage the coded audio frame buffers for seamlessgeneration by the output device and without introducing intermissions orgaps in the output stream. The data pump process queues the audio bufferthat has most recently completed output. When a buffer finishes outputit is reposted back to the output buffer queue and flagged as empty.This empty state flag permits a mixing process to identify and copy datainto that unused buffer simultaneously as the next buffer in the queueis output and while the remaining buffers wait for output. To prime thedata pump process the queue list must first be populated with null audiobuffer events. The content of the initialization buffers whether codedor not should represent silence or other inaudible or intended signal.The number of buffers in the queue and size of each buffer influencesthe response time to user input. To keep latency low and provide a morerealistic interactive experience the output queue is restricted to twobuffers in depth while the size of each buffer is determined by themaximum frame size permitted by the destination decoder and byacceptable user latency.

Audio quality may be traded off against user latency. Small frame sizesare burdened by the repeat transmission of header information, whichreduces the number of bits available to code audio data therebydegrading the rendered audio while large frame sizes are limited by theavailability of local DSP memory in the home theater decoder therebyincreasing user latency. Combined with the sample rate the twoquantities determine the maximum refresh interval for updating thecompressed audio output buffers. In the DTS interactive system this isthe time-base that is used to refresh the localization of sounds andprovide the illusion of real-time interactivity. In this system theoutput frame size is set to 4096 bytes offering a minimum header size,good time resolution for editing and loop creation and low latency touser responses. At each frame time the distance and angle of an activesound relative to the listener's position is calculated and thisinformation used to render individual sounds. As an example refreshrates of between 31 Hz to 47 Hz depending on sample rate are possiblefor a frame size of 4096 bytes.

Looping Compressed Audio

Looping is a standard gaming technique in which the same sound bits arelooped indefinitely to create a desired audio effect. For example, asmall number of frames of a helicopter sound can be stored and looped toproduce a helicopter for as long as the game requires. In the timedomain, no audible clicking or distortion will be heard during thetransition zone between the ending and the starting positions of thesound if the amplitude of the beginning and ends are complementary. Thissame technique does not work in the compressed audio domain.

Compressed audio is contained in packets of data encoded from fixedframes of PCM samples and further complicated by the inter-dependence ofcompressed audio frames on previously processed audio. Thereconstruction filters in the DTS surround sound decoder delays theoutput audio such that the first audio samples will exhibit a low leveltransient behavior due to the properties of the reconstruction filter.

As shown in FIG. 5, the looping solution implemented in the DTSInteractive system is done off-line to prepare component audio forstorage in a compressed format that is compatible with the real-timelooping execution in the interactive gaming environment. The first stepof the looping solution requires the PCM data of a looped sequence to befirst compacted or dilated in time to fit precisely within theboundaries defined by a whole number of compressed audio frames (step100). Encoded data is representative of a fixed number of audio samplesfrom each encoded frame. In the DTS system the sample duration is amultiple of 1024 samples. To begin, at least N frames of uncompressed‘lead-out’ audio are read out from the end of the file (step 102) andtemporally appended to the start of the looped segment (step 104). Inthis example N has value 1 but any value sufficiently large to cover thereconstruction filters dependency on previous frames may be used. Afterencoding (step 106), N compressed frames are deleted from the beginningof the encoded bit-stream to yield a compressed audio loop sequence(step 108). This process ensures that the values resident in thereconstruction synthesis filter during the closing frames is inagreement with the values necessary to ensure seamless concatenationwith the commencing frame and in doing so prevent audible clicking ordistortion. On looped playback the read pointers are directed back tothe start of the looped sequence for glitch free playback.

DTS Interactive Frame Format

A DTS Interactive frame 72 consists of data arranged as shown in FIG. 6.The header 110 describes the format of the content, the number ofsubbands, the channel format, sampling frequency and tables (defined inthe DTS standard) required to decode the audio payload. This region alsocontains a sync word to identify the start of the header and providealignment of the encoded stream for unpacking.

Following the header, bit allocation section 112 identifies whichsubbands are present in a frame, together with an indication of how manybits are allocated per subband sample. A zero entry in the bitallocation table indicates that the related subband is not present inthe frame. The bit allocation is fixed from component to component,channel-to-channel, frame-to-frame and for each subband for mixingspeed. A fixed bit allocation is adopted by the DTS Interactive systemsand removes the need to examine, store and manipulate bit allocationtables and eliminates the constant checking of bit width during theunpacking phase. For example the following bit allocation is suitablefor use {15, 10, 9, 8, 8, 8, 7, 7, 7, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5,5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}.

The scale factor section 114 identifies the scale factor for each of thesubbands, e.g. 32-subbands. The scale factor data varies fromframe-to-frame with the corresponding subband data.

Lastly, the subband data section 116 includes all of the quantizedsubband data. As shown in FIG. 7, each frame of subband data consists of32 samples per subband organized as four vectors 118 a-118 d of sizeeight. Subband samples can be represented by linear codes or by blockcodes. Linear codes begin with a sign bit followed by the sample datawhile block codes are efficiently coded groups of subband samplesinclusive of sign. The alignment of the bit allocation 112 and scalefactors 114 with subband data 116 is also depicted.

Subband-Domain Mixing of Compress Audio

As described previously, DTS Interactive mixes the component audio in acompressed format, e.g. subband data, rather than the typical PCM formatand thus realizes tremendous computational, flexibility and fidelitybenefits. These benefits are obtained by discarding those subbands thatare inaudible to the user in two stages. First, the gaming programmercan, based on a priori information about the frequency content of aspecific audio component, discard the upper (high frequency) subbandsthat contain little or no useful information. This is done off-line bysetting the upper band bit allocations to zero before the componentaudio is stored (Step 120).

More specifically, sample rates of 48.0 kHz, 44.1 kHz and 32.0 kHz arefrequently encountered in the audio but the higher sample rates offerhigh fidelity full bandwidth audio at the cost of memory. This can bewasteful of resources if the material contains little high frequencycontent such as voice. Lower sample rates may be more appropriate forsome material but the problems of mixing differing sample rates arise.Game audio frequently uses the 22.050 kHz sampling rate as a goodcompromise between both audio quality and memory requirements. In theDTS Interactive system all material is encoded at the highest supportedsample rate mentioned earlier and material that does not fully occupythe full audio spectrum is treated as follows. Material intended forencoding at say 11.025 kHz is sampled at 44.1 kHz and the upper 75% ofsubbands describing the high frequency content are discarded. The resultis an encoded file that retains compatibility and ease of mixing withother higher fidelity signals and yet allows a reduced file size. It iseasy to see how this principle can be extended to enable 22.050 kHzsampling by discarding the upper 50% of subbands.

Second, DTS Interactive unpacks the scale factors (step 122) and usesthem in a simplified psychoacoustic analysis (see FIG. 9) to determinewhich of the audio components on the list are audible in each subband(step 124). A standard psychoacoustic analysis that takes into accountneighboring subbands could be implemented to achieve marginally betterperformance but would sacrifice speed. Thereafter, the audio rendererunpacks and decompresses only those subbands that are audible (step126). The renderer mixes the subband data for each subband in thesubband domain (step 128), recompresses it and provides it to thepipelines as detailed in FIG. 4 for packing.

The computational benefits of this process are realized from having tounpack, decompress, mix, recompress and pack only those subbands thatare audible. Similarly, because the mixing process automaticallydiscards all of the inaudible data, the gaming programmer is providedgreater flexibility to create richer sound environments with a largernumber of audio components without raising the quantization noise floor.These are very significant advantages in a real-time interactiveenvironment where user latency is critical and rich high fidelityimmersive audio environment is the goal.

Psychoacoustic Masking Effects

Psychoacoustic measurements are used to determine perceptuallyirrelevant information, which is defined as those parts of the audiosignal which cannot be heard by human listeners, and can be measured inthe time domain, the subband domain, or in some other basis. Two mainfactors influence the psychoacoustic measurement. One is the frequencydependent absolute threshold of hearing applicable to humans. The otheris the masking effect that one sound has on the ability of humans tohear a second sound played simultaneously or even after the first sound.In other words the first sound, in the same or neighboring subband,prevents us from hearing the second sound, and is said to mask it out.

In a subband coder the final outcome of a psychoacoustic calculation isa set of numbers which specify the inaudible level of noise for eachsubband at that instant. This computation is well known and isincorporated in the MPEG 1 compression standard ISO/IEC DIS 11172“Information technology—Coding of moving pictures and associated audiofor digital storage media up to about 1.5 Mbits/s,”1992. These numbersvary dynamically with the audio signal. The coder attempts to adjust thequantization noise floor in the subbands by way of the bit allocationprocess so that the quantization noise in these subbands is less thanthe audible level.

DTS Interactive currently simplifies the normal psychoacoustic maskingoperation by disabling the inter-subband dependence. In the finalanalysis, the calculation of the intra-subband masking effects from thescale factors will provide the three or four audible components in eachsubband, which may or may not be the same from subband to subband. Afull psychoacoustic analysis may provide more components in certainsubbands and completely discard other subbands, most likely the uppersubbands.

As shown in FIG. 9, the psychoacoustic masking function examines theobject list and extracts the maximum scale value for each subband of thesupplied component streams (step 130). This information is input to themasking function as a reference for the loudest signal that is presentin the object list. The maximum scale factors are also directed to thequantizer as the basis for encoding the mixed results into the DTScompressed audio format.

For DTS-domain filtering, the time-domain signal is not available, somasking thresholds are estimated from the subband samples in the DTSsignal. A masking threshold is calculated for each subband (step 132)from the maximum scale factor and the human auditory response. The scalefactor for each subband is compared to the masking threshold for thatband (step 136) and if found to be below the masking threshold set forthat band then the subband is considered to be inaudible and removedfrom the mixing process (step 138) otherwise the subband is deemed to beaudible and is kept for the mixing process (step 140). The currentprocess only considers masking effects in the same subband and ignoresthe effects of neighboring subbands. Although this reduces performancesomewhat, the process is much simpler, hence much faster as required inan interactive real-time environment.

Bit Manipulation

As discussed above, DTS Interactive is designed to reduce the number ofcomputations required to mix and render the audio signal. Significanteffort is expended to minimize the quantity of data that must beunpacked and repacked because these and the decompress/recompressoperations are computationally intensive. Still the audible subband datamust be unpacked, decompressed, mixed, compressed and repacked.Therefore, DTS Interactive also provides a different approach formanipulating the data to reduce the number of computations to unpack andpack the data as shown in FIGS. 10-10 c and to mix the subband data asshown in FIG. 11.

Digital Surround systems typically encode the bit stream using variablelength bit fields to optimize compression. An important element of theunpacking process is the signed extraction of the variable length bitfields. The unpacking procedure is intensive due to the frequency ofexecuting this routine. For example to extract an N-bit field, 32-bit(DWORD) data is first shifted to the left to locate the sign bit in theleft most bit field. Next, the value is divided by powers of two orright shifted by (32-N) bit positions to introduce the sign extension.The large number of shifting operations take a finite time to executeand unfortunately cannot be executed in parallel or pipelined with otherinstructions on the present generation of Pentium processors.

DTS Interactive by takes advantage of the fact that the scale factor isrelated to the bit width size and realizes that this provides thepossibility to ignore the final right shifting operation if a) in itsplace the scale factors are treated accordingly and b) the number ofbits that represent the subband data are sufficient that the “noise”represented by the (32-N) right most bits is below the noise floor ofthe reconstructed signal. Although N may be only a few bits thistypically only occurs in the upper subbands where the noise floor ishigher. In VLC systems that apply very high compression ratios the noisefloor could be exceeded.

As shown in FIG. 10 a, a typical frame will include a section of subbanddata 142, which includes each piece of N-bit subband data 142 where N isallowed to vary across the subbands but not the samples. As shown inFIG. 10 b, the audio renderer extracts the section of subband data andstores it in local memory, typically as 32-bit words 144 where the firstbit is the sign bit 146 and the next thirty-one bits are data bits.

As shown in FIG. 10 c, the audio renderer has shifted subband data 142to the left so that its sign bit is aligned with sign bit 146. Since allof the data is stored as FLCs rather than VLCs this a trivial operation.The audio renderer does NOT right shift the data. Instead, the scalefactors are prescaled by dividing them by 2 raised to the power of(32-N) and stored and the 32-N rightmost bits 148 are treated asinaudible noise. In other words, a one bit left shift of the subbanddata combined with a one bit right shift of the scale factor does notalter the value of the product. The same technique can also be utilizedby the decoder.

After summation of all mixing products and quantization it is a simplematter to identify those values that will overflow since the storagelimit is fixed. This offer greatly superior detection speed incomparison to a system where the subband data has not be treated by theleft shift operation.

When the data is repacked, the audio rendered simply grabs the firstN-bits from each 32-bit word thereby avoiding 32-N left shiftoperations. The avoidance of (32-N) right and left shift operations mayseem to be rather insignificant but the frequency of executing theunpack and pack routines is so high that it represents a significantreduction in computations.

Mixing Subband Data

As shown in FIG. 11, the mixing process commences and the audiblesubband data is multiplied by the corresponding scale factor, which hasbeen adjusted for position, equalization, phase localization etc, (step150) and the sum is added to the corresponding subband products of theother eligible items in the pipeline (step 142). Since the number ofbits for each component in a given subband is the same the step sizefactors can be ignored thus saving computations. The maximum scalefactors indexes are looked up (step 154) and the inverse is multipliedby the mixed result (step 156).

Overflow can occur when the mixed result exceeds the value stored in oneDWORD (step 158). Attempting to store a floating point word as aninteger creates an exception which is trapped and used to correct thescale factor applied to all affected subbands. If the exception occurs,the maximum scale factor is incremented (step 160) and the subband datais recalculated (step 156). The maximum scale factors are used as astarting point because it is better to err on the conservative side andhave to increment the scale factor rather than reduce the dynamic rangeof the signal. After the mixing process, data is stored in left shiftedform by modification of the scale factor data for recompression andpacking.

While several illustrative embodiments of the invention have been shownand described, numerous variations and alternate embodiments will occurto those skilled in the art. For example, two 5.1 channel signals couldbe mixed and interleaved together to produce a 10.2 channel signal fortrue 3D immersion with the added dimension of height. In additionprocessing combination, instead of processing one frame at a time, theaudio renderer could reduce the frame size by one-half and process twoframes. This would reduce latency by about one-half at the cost ofwasting some bits on repeating the header twice as often. However, in adedicated system much of the header information could be eliminated.Such variations and alternate embodiments are contemplated, and can bemade without departing from the spirit and scope of the invention asdefined in the appended claims.

1. A method of preparing PCM audio data for storage in a compressed format that is compatible with looping, wherein said PCM audio data is stored in a file and the compressed format includes a sequence of compressed audio frames comprising: a. Compacting or dilating the PCM audio data in time to fit the boundaries defined by a whole number of compressed audio frames to form a looped segment; b. Appending N frames of PCM audio data from the end of the file to the start of the looped segment; c. Encoding the looped segment into a bitstream; and d. Deleting N compressed frames from the beginning of the encoded bitstream to yield a compressed audio loop sequence in which the compressed audio data in the closing frames of the loop sequence ensure seamless concatenation with the commencing frames during looping. 